Adding MultiModal ShardedDataloader #262

ashvinnihalani · 2024-11-19T02:53:36Z

Description

Adding a MultiModal Sharded Dataloader and several of the samplers to be used

Additional context

ShardedDataloader is a flexable multimodal dataloader that attempts to solve the use case of large scale llm/vlm training. A single backed instance of S3 dataloader is not sufficient for the following reasons will face the following issues which this dataloader attempts to address:

PreDownloading Times: Often times other dataloaders implementations either require you to either predownload parts of whole of the datasets. Even datasets implementation like IndexedDataset require you to generate index files either ahead of time or during runtime which is expensive
Fast Access: MMap datasets are often infeasable when a customer want to combine many seperate datasets. Without Mmap, access times are slow and with MMAP you can hit the open files limit of the system
Flexibility: Dataloaders sometimes make the trade off and sacrifice performance by pretokenize/preembed multimodal components. This doesn't scale to complex multimodal types like video and often requires individual to reprocess their data every time a data prep step changes

I have updated the CHANGELOG or README if appropriate

Related items

Testing

By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.

IsaevIlya · 2024-11-22T14:57:32Z

s3torchconnector/src/s3torchconnector/s3sharded_dataloader.py

+                    sub_shards.extend(shards[i])
+                if len(sub_shards) == num_uri_merge:


If we are expending sub_shards by a list, isn't it possible that size of sub_shards exceeds num_uri_merge, so the condition will be false and we continue to expand sub_shards?

IsaevIlya · 2024-11-26T15:17:57Z

s3torchconnector/src/s3torchconnector/s3sharded_dataloader.py

+class S3ShardSampler(ShardSampler, pl.core.hooks.CheckpointHooks):
+    def __init__(self, uri: str, glob: Optional[str] = None, recursive: bool = True, num_uri_merge: int = 0):
+        s3_client = self._get_client()
+        self.shards: List[str] = list(get_objects_from_uris(uri, s3_client) # type: ignore


Isn't it missing a closing parenthesis?

Actually, what is the idea behind that line? uri represents one object, so get_objects_from_uris will return one S3BucketKeyData that has passed uri parsed way to represent bucket and key. As I understood self.shards is not expected to get S3BucketKeyData.

Adding MultiModal ShardedDataloader

0e1c195

ashvinnihalani had a problem deploying to integration-tests November 19, 2024 02:53 — with GitHub Actions Failure

IsaevIlya reviewed Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding MultiModal ShardedDataloader #262

Adding MultiModal ShardedDataloader #262

ashvinnihalani commented Nov 19, 2024 •

edited by matthieu-d4r

Loading

IsaevIlya Nov 22, 2024

IsaevIlya Nov 26, 2024

IsaevIlya Nov 26, 2024

		sub_shards.extend(shards[i])
		if len(sub_shards) == num_uri_merge:

Adding MultiModal ShardedDataloader #262

Are you sure you want to change the base?

Adding MultiModal ShardedDataloader #262

Conversation

ashvinnihalani commented Nov 19, 2024 • edited by matthieu-d4r Loading

Description

Additional context

Related items

Testing

IsaevIlya Nov 22, 2024

Choose a reason for hiding this comment

IsaevIlya Nov 26, 2024

Choose a reason for hiding this comment

IsaevIlya Nov 26, 2024

Choose a reason for hiding this comment

ashvinnihalani commented Nov 19, 2024 •

edited by matthieu-d4r

Loading